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Q ■ Abstract. Recurrent neural networks are often used for 

■ learning time-series data. Based on a few assumptions 

we model this learning task as a minimization problem 
of a nonlinear least-squares cost function. The special 

■ structure of the cost function allows us to build a connec- 
^ ■ tion to reinforcement learning. We exploit this connec- 
tion and derive a convergent, policy iteration-based algo- 
rithm. Furthermore, we argue that RNN training can be 

^ ' fit naturally into the reinforcement learning framework. 
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. 1. Introduction 

O _ 

^ i Recurrent neural networks (RNNs) are attractive tools for the learn- 

ing of time-series. However, traditional long-term prediction methods 
either work as iterated 1-step methods (Fig. ^a)) or by direct learn- 
^ . ing of the fc-step-ahead value (Fig. E^b)). There is vast amount of 

^ I literature on RNNs, which is reviewed in the accompanying paper^ 

In the former scheme, during the learning phase, prediction errors 
are computed relative to the previous sequence values, so errors do 
not accumulate. However, this does not capture well the behavior in 
the testing phase, when errors cannot be corrected step-by-step, so 
they accumulate rather quickly. The latter scheme escapes this trap, 
but it needs a different estimator for each lookahead time k, which 
is far from being economic (although implementations exist, see, e.g. 



^This technical report is a supplementary material for the manuscript 'PI- 
RANHA: Policy Iteration for Recurrent Artificial Neural Networks with Hidden 
Activities' written by I. Szita and A. Lorincz. The interested reader is kindly 
referred to the a recent review on this topic ( |Bone and Crucianu, 2002] ) and ref- 
erences therein. 
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( |Duhoux et al., 200 1| )). We take a novel approach: perform iterated 
prediction without correction, and formulate an objective function that 
directly takes all the prediction errors into account (see Fig. [Tfc)). 



>o 




iterated 1-step prediction singie l<-step prediction single sequence replay prediction 

Figure 1. Comparison of time series learning 
methods 

Black dots: the original time series, dotted arrows: pre- 
dictions, gray arrows: corrections during learning. 

The resulting objective function is a least-squares function of highly 
nonlinear terms, being intractable for traditional minimization tech- 
niques. We proceed by showing that the task can be interpreted as a 
reinforcement learning problem of a hypothetical agent in an abstract 
environment. This connection enables the minimization of our objec- 
tive function using a version of policy iteration. 

The outline of the paper is as follows. The theoretical part is made of 
three sections. First, we define the learning task and derive the learn- 
ing rules of our algorithm in Section |21 In Section |21 the optimization 
problem is rewritten into a Policy Iteration Algorithm for Recurrent Ar- 
tificial (neural) Networks with Hidden Activities (PIRANHA).^ Using 
this novel form, we give a convergence proof of PIRANHA in Section |3] 
In Section El we argue that considering RNN learning as a RL process 
is consistent and fits well into the "classical" RL framework and to 
ongoing recent efforts in RL. The last section summarizes our results. 

2. Definitions and Basic Concepts 

2.1. The Network Architecture. We consider fully connected re- 
current neural networks with m input neurons and n > m hidden 
neurons. We do not use a separate output layer; the states of the first 
m hidden neurons are considered as outputs. The state of each neuron 

^PIRANHA can be viewed as a gradient based approach. The purpose of this 
technical report is to extend the gradient based description of the accompanying 
paper to the framework of RL and to provide insight why and how PIRANHA 
works. 
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is a real number in the interval [—1, +1], because neurons admit acti- 
vation - squashing - function, which maps onto interval [—1, +1]. An 
example is function a{z) = tanh(z). The input, output and the hidden 
state at time t are denoted by Vt € [— l,-!-!]™, Xt+i G [—1,+!]™' and 
Ut e [—1,+!]", respectively. No explicit bias term is applied, instead 
a constant 1 might make the (m + I)'** component of the input. Let 
the recurrent, the input and the output weight matrices be denoted by 
F e M">^", G G M"x™+i, and H G M*"^", respectively. Let H he a fixed 
matrix projecting to the first m components of Uf. The dynamics of 
the network is as follows: 

(1) ut+i = a{Fut + Gvt), 

(2) xt+i = Hut+i. 

with the initial condition uq = 0. 

2.2. Problem Description. Suppose that a time series {xt G | 
t = 1,2,3,...} and a sequence length T is given. We have to find 
weight matrices F and G so that the network is able to replay the 
first T elements of the sequence in correct temporal order, i.e. for any 
1 < t < T, having input Vf := Xf for t' < t and generating Vf := Xf 
for t <t' <T, the total squared error Yl't'=t W^t' ~ ^t'll^ small. 

The sum of the least-square errors for all possible t' values character- 
izes the replay capability of the network with the network weights given. 
However, this cost function may have pitfalls for ordinary gradient- 
based methods, because little changes in weights may considerably in- 
fluence the output many steps ahead. Sensitivity may become crucial 
under this condition. 

Our key observation to the solution is that the same problem emerges 
in reinforcement learning (RL). We shall reformulate the learning task 
to highlight this connection and we shall apply RL algorithms to opti- 
mize our problem. 

2.3. Constructing an Appropriate Cost Function. The tradi- 
tional approach would be to search for weight matrices F and G that 
minimize the reconstruction error 

T T 

(3) ^\\xt- xtf = '^\\Hut- xtf 

t=Q t=0 

subject to the constraints (Q)-®- It is well known that such recurrent 
network optimization tasks are hard, because the objective function 
has many local minima and is often ill-conditioned. Let us investigate 
some possible reasons: consider the case when all the predicted output 
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values Xt are correct, except for a single time step to. The hidden 
state for this time step is also bad, and we can modify it only by 
modifying the weights F and G. However, this modification is likely 
to compromise all the other errors, resulting in a total increase in the 
objective value. The reason for this phenomenon is that the hidden 
states for different time steps are very strongly coupled by Eq. ((T)). 

The underlying idea of our approach is that if this coupling is relaxed, 
the resulting error surface may become smoother. Naturally, in the 
end we have to enforce Eq, (jT)), but first let us consider a different set 
of constraints. Suppose that the sequence U = (mq, Mi, M2, • • •) is an 
arbitrary state sequence, not necessarily belonging to an RNN. When 
can we say that matrices F, G and a state Ut in the sequence is a good 
predictor? The one-step prediction from ut is 

(4) xt^(t+i) = H(y{Fut + Gxt), 
for two steps it is 

(5) Xt^{t+2) = Ha{Fa{Fut + Gxt) + Gxt^^t+1)) 

= Ha{{F + GH)a{Fut + Gxt)), 
and so on. Let us introduce the notation 

(6) ■5JiF,G)(") •= ^(^" + ^^t) 

for the effect of (F, G) on a single RNN state u E M" and 

(7) s[%^^^{u) := a ((F + GH)s['-^%^{u)) 

to describe the effect of applying [F, G) k (> 1) times on state u. For 
later use we also define s^^^ as the identity function. 

Using these notations, the fc-step prediction error of state Ut is 

(8) {4%,G)(^t) - Xt+k) . 

(For the sake of notational simplicity, we assume that the input se- 
quence Xt is defined for t > T as well.) The overall cost of state Ut 
should be the the sum of the fc-step prediction error norms, for all time 
instants and for all ks. To ensure convergence, we use a geometrically 
decaying weight sequence with decay rate 7. So the total cost of the 
state sequence U is 

00 T 2 

(9) J(U,F,G) = J2Y.^' \\H{si%,G)M) -Xt+,\ 

k=0 t=l 

Note that if a set of weights (F, G) minimizes Eq. ^ subject to 
Eq. ^ , it also minimizes Eq. (jH)) , and technical assumptions can ensure 
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that the converse also holds. Note also that cost function in Eq. has 
a large number of partial sequences and so, it has an enormous number 
of adjustable quantities. This gives one great freedom in accomplishing 
the minimization. However, there is a price to pay: minimization is 
ill-posed, because the number of parameters is much larger than the 
number of data points. We restrict the choices to the parameters of 
the original problem, but in a way which is different from Eq. 0, and 
which reflects the structure of the sequence replay problem better. 

To this end, note that if a particular state Ut is a good predictor at 
time step t using weights (F, G), then a{Fut + Gxt) is a good predictor 
at time step t + 1, following from the structure of the cost function. 
Let 

(10) 5'(F,G)U := (mo,^!, . . . where 

«o = 0, 

u[^-^ = a{Fut + Gxt) for < t < T. 

Using this notation the above statement says that if U is a good pre- 
dictor with G), S[F,G)^ is also a good predictor. This argument 
can be applied iteratively, yielding that S^pQ^XJ is a good predictor for 
k > 1. However, it is easy to see that for > T S^p^^X] is equal to the 
state sequence generated by (HJ, and will be denoted by \](p^g)'- 

(11) U(F,G) := (mo,Mi, • • • ,Mt), where 

Mo = 0, 

ut+i = (j{Fut + Gxt) for < t < T. 

This justifies the following improvement scheme: 
. fix U, Fo, Go 

• find {F,G) for which J{S^f,g)V , F'q, Gq) is small 

• continue with U := U(f^g), -^o '■= F and Gq = G. 

In Section |3] we will formally prove the correctness of the method, 
but before doing so, we elaborate on some of the details. 

First of all, note that the number of adjustable parameters is the 
same as in the original method, but the parameters F and G play a 
different role: they are chosen so that they minimize the multi-step 
prediction errors, and relation (^Q) is part of the full optimization prob- 
lem.^ 

"^Several numerical simulations showed the step U := \J[f,g) could be substituted 
by U := S'(i?^G)U or U := S^p q^XJ for any k and under such conditions, constraint 
does not appear at all. However, theoretical analysis is easier for the definition, 
which includes Eq. and this definition will be used here. 
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2.4. Computing the Gradient. In this subsection we work out the 
details of the above scheme. 

Let us denote the state sequence at iteration i by Uj, and the weight 
matrices by Fj and Gi, respectively. We would like to find F and G so 
that 



(12) J'(F,G) 



'■ <^('S'{F,G)Ui, Fj, Gj) — 

T oo 

t=l k=l 

T oo 

t=l k=l 

is minimized, that is, we are minimizing the cost by propagating all 
states Ut by (F, G) once, and then by propagating with (Fj, Gi) further. 

In order to do this, we have to compute the gradient of the cost 
function with respect to the weights. For any 1 < a,b,c < n and 
1 < d < m + 1, the gradients are given by 



/ f) J' \ 



T oo 



t=l k=l 



and 
(14) 

where 

(15) 



dJ' 



T oo 



dG 



cd 



(F„G.) = ^5^7' {et,kV H[m{k)i 



t=i k=i 



Xt+k 

1 



m 



denotes transposition and [v]b denotes the b*^ component of vector 



V. 



If we can ensure that the terms m{k) remain bounded, then the 
above sums are convergent, because all terms can be bounded by G ■'j^. 
This means that if is a sufficiently large positive number, then the 
gradients of 

T K 

(16) J\F,G):=Y,J2^'\Mf 

t=l k=l 

will be arbitrarily close to (fT^ and (fT^ , so we can use it as an approx- 
imation. 
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Using the above formulae, we can obtain weight matrices better than 
(Fj, Gi) by taking a gradient descent step. The description of the algo- 
rithm is therefore complete. We have summarized it in Fig. |21 In the 
next section we will show that the algorithm is can be seen as policy 
iteration of an RL problem, which justifies its name. Policy Iteration 
for Recurrent Artificial (neural) Networks with Hidden Activities, or 
PIRANHA for short. 

input: a, K, T, (xi, . . . , Xt)', 
initialize Fo,Go; i := 0; 
for i := 1 to maxJter 



Figure 2. Pseudo-code of the PIRANHA algorithm 

3. Interpreting PIRANHA as Policy Iteration 

Notice that the cost function (jH)) is formally very similar to the cost 
function of a reinforcement learning problem. Furthermore, the pro- 
posed algorithmic solution is very much like policy iteration, a widely 
used algorithm for solving RL problems. Although this similarity is 
only formal, as in our case uncertain transitions are not considered, 
the policy iteration formalism can be matched perfectly, and this gives 
us valuable insight how PIRANHA works. Furthermore, the similarity 
enables us to prove that under appropriate conditions the algorithm is 
convergent. 

Firstly, we give a brief overview of the reinforcement learning frame- 
work and policy iteration. Next we reformulate the sequence learning 
problem as a special case of RL, and point out some important dif- 
ferences concerning traditional RL problems. Finally, using the policy 



Ui := U(F„G,); 



for t := to T 

mil) ■■= Fi- 

for k = 1...{K 



1), 




end 
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iteration formalism, we present our main result about the convergence 
of PIRANHA. 



3.1. The Reinforcement Learning Framework and Policy Iter- 
ation. RL deals with agents that make decisions, i.e., selects actions in 
a stochastic environment. The state of the environment is influenced 
by the decisions of the agent. From the point of view of the agent, 
the actual state is (fully or partially) observed, actions are selected 
and state-dependent immediate costs are received. RL aims to mini- 
mize the total cumulated cost by finding the optimal decisions {optimal 
policy) for the agent. 

In most cases, RL problems are treated as Markov decision processes, 
i.e., states are fully observable and rewards and successor states depend 
only on the current state and action but not on the history. For the sake 
of simphcity, we also assume that costs arc bounded and deterministic, 
and depend only on the current state. Let us denote the state space 
and the actions of the agent by S and A, respectively. The dynamics of 
the environment is characterized by the transition probability function 
P : S X A X S ^ [0,1] and the immediate cost function c : S ^ W. A 
deterministic policy n : S ^ A of the agent is a mapping from states 
to actions. Future costs to be cumulated may be discounted by some 
factor < 7 < 1, so the expected cost of a state sq is 



where E{.) denotes expectation, and for each t > 0, at = 7r(st) and the 
system arrives at St+i with probabihty P{st,at, St+i). The task is to 
find an optimal pohcy tt*, for which J'^* (s) < J'^{s) for any state s and 
any policy vr. 

Policy iteration approaches the optimal policy by a two-phase iter- 
ation procedure: at each iteration i, the current policy tt, is first eval- 
uated, i.e. J'^^{s) is computed (or approximated), usually by letting 
the agent to take many steps using tTj and processing the experienced 
costs. The other phase is called policy improvement. In this phase the 
policy is modified by using the inequality 



(17) 




(18) 




c(8) + 7 5^P(s,7r(s),s')J^'(s') 



s' 



(19) 




s' 



which implies that 



(20) 




s' 
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is an improvement over VTj.^ The improvement can be gradual: there 
is no need to find the minimum, but it is sufficient if a partial and/or 
approximate step is made in that direction. 

If transition probabilities, policies and cost function values are stored 
in a lookup-table, with separate entries for each different argument, 
policy iteration converges to an optimal pohcy under appropriate con- 
ditions. This is true even if approximations are used in either step 
or P and c is not known in advance but have to be estimated from 
the agent-environment interaction. For details about the conditions 
of the various convergence results, see ( |Bertsekas and Tsitsiklis, 1996[ 
Sutton and Barto, 1998| ). 

In some cases, the construction of the lookup-table is not feasible, 
e.g., when the state and/or the action spaces are continuous. Then 
one needs to revert to function approximation methods. It has been 
shown, however, that policy iteration for function approximators can be 
divergent even for the simplest cases ( IBertsekas and Tsitsikhs, 1996| ) 
and, in turn, the proof of convergence becomes a central issue. 

3.2. Fitting PIRANHA into the RL Framework. 

3.2.1. The Agent and its Environment. Let us deffne a hypothetical 
agent that observes the state sequence U, and based on that, chooses 
a set of weights {F,G), which is used for prediction. Formally this 
corresponds to S = M"^-^, so that a state U G iS of the agent is the 
whole sequence of hidden states of the RNN: U = {ui,U2, ■ ■ ■ ,ut), 
while the action space will be A := The environment of 
this hypothetical agent is quite simple: if the agent makes an action 
{F, G) in state U, then the environment transfers it deterministically 
to the successor state S'(i?^G')U. 

In a general-case RL problem, a policy of the agent is a mapping 
IT : S ^ A. However, in our problem, time independent weight matri- 
ces {F and G) are searched for, therefore we restrict the mapping to 
constant policies. Such policies 'n'(F,G) execute the fixed action {F,G) 
regardless of the current state. For a policy vr = (F, G) we will also use 
the notation S'^(.) instead of S(f,g){-)- 

3.2.2. Costs. Let U = (-ui, ^2, mt) be an arbitrary state. Its imme- 
diate cost c(U) is defined as the error of reconstructing the input series 



Note that for policy improvement, it is not necessary to choose the minimum 
in the equation. 



10 



I. SZITA AND A. LORINCZ 



Xt. 

T 

(21) c(U) ■.= Y,\\Hut-Xt\\\ 

t=i 

Thus, by ffTTj) . the total discounted cost of U using pohcy vr = (F, G) 
and discount factor < 7 < 1 is 

(22) := c(U)+7c(5^U)+7M5^U) + ... 

00 T 2 

(23) = , 

A:=0 t=l 

which is the same cost function as This cost function reverts to the 
cost function of Eq. El (=Eq. I?T|) for 7 = 0, but then pohcy iteration 
reduces to the iterative 1-step method. 

3.2.3. Policy Evaluation. At iteration i > 0, we have a pohcy tTj = 
'n'{F,,Gi) t)e evaluated. The evaluation starts by taking many steps 
using TTj, which (by (fTTjl ) guarantees that the agent is in state \J(^p.^Gi)- 
Therefore we have to evaluate the cost function in a single state XJi^f^^g^) 
only, which - in contrast to traditional policy iteration - can be done 
directly. 

3.2.4. Policy Improvement. Policy improvement will be accomplished 
only for the single state \J(Fi,Gi), because this is the only state where 
the full cost is known. As a further deviation from the basic policy 
improvement strategy (j^UI) . we take only a small step towards 

(24) arg min (c(U(^^,g.)) + 7>(%G)U(i.„ao)) := 

(25) argmin J'(F,G), 

and improve the policy gradually: 

(26) 7r,+i(U(,.,,G,)) = (i^,+i, G,+i) := (F„ G,) - a,VJ'{F,, G,), 

where is the learning rate for iteration i, and V denotes the gradient 
with respect to components of (F, G) . 

Iterating these two steps completes policy iteration, and as it can 
be easily seen, it is identical to the PIRANHA algorithm described in 
Fig. 121 
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3.2.5. Differences from Standard Policy Iteration. There are several 
important differences between standard policy iteration algorithm and 
PIRANHA. Firstly, in the policy improvement step we modified the 
policy so that only the cost of a single state, Uj is considered. How- 
ever, changing the policy in Uj changes the costs of every other state 
as well, and we have no guarantee that the change is really an improve- 
ment for all states (in fact, it can be shown that there will be states 
for which the cost increases). Thus, we cannot apply any of the con- 
vergence results of (approximate) policy iteration algorithms, because 
they all require improvement over the whole state space. Fortunately, 
J'''(Uj) > J'^'+^(Uj+i) still holds, but we need to prove this by taking 
into account the specific structure of the cost function. 



4. Analysis 

4.1. The Finite- Approximation. The gradients of ()12|) are infi- 
nite sums, which are approximated by finite sums up to the K'^'^ term. 
The first question we have to answer is whether this approximation is 
feasible. The following lemma shows that for sufficiently small discount 
factor 7, the gradients (|13p and (jl4p are convergent, therefore for suf- 
ficiently large K, the gradients of (jl2j) will be approximated well by 

(HSl). 

Lemma 1. Suppose that (a) For alii, \\Fi\\^ < l/^/y, (h) The slope 
of the sigmoid function is at most 1, i.e. suy)z(t'{z) < 1. Then for 
any eo > there exists a K so that if Mb]) is used for calculating the 
gradient, the difference is less than eo- 

Proof. The difference of derivatives with respect to some component 
Fab is 



(27) 
(28) 
(29) 



dJ' 



{Fi, Gi 



dJ' 



{Fi, Gi 



t=l k=K+l 

oo T 



< 



< 



E E 

k=K+l t=l 



7 m,k\ 



\H\ 



l"^(^)lloo \Wt\ 



But Ut is an inner state of a neural network, so Hm* < 1, and similarly, 
||ef,fc||oo < 2, and = 1 because it is a projection matrix. Using 

()15|) and Assumption (b), the norm of m{k) can be bounded from above 
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by 

(30) \\m{m^<m\t<r'^\ 

so the difference of tlie exact and approximate derivatives can be bounded 

as 

oo T oo 

(31) Yl $^7'-2-l-7-'^/'-l= Yl '^T-f^'^ = 

k=K+l t=l k=K+l 



(32) = CW". 

wliich is can be made less tlian eo for sufficiently large K. One may 
use (jl4p to show that the same bound applies for the differences of 
derivatives with respect to elements of G. □ 

4.2. Convergence proof. We should not that proving the conver- 
gence of the algorithm of Table |21 is easy, because it is a gradient de- 
scent method, therefore, as it is well known, it converges to a (local) 
minimum if the step sizes are sufficiently small. 

However, the algorithm offers new possibilities within the framework 
of reinforcement learning as we shall discuss it later. We give an al- 
ternative proof that is somewhat more complicated, but exploits the 
policy iteration reformulation. This derivation reflects the mechanism 
and the potentials of the algorithm better. 



5. The proof of the Convergence Theorem 

First, suppose that J and J' can be computed exactly. We proceed 
by a series of lemmas: 

Let TTj = (Fj, Gi) be the policy at iteration i, Ut,^ = (mq, Ui, . . . , ut), 
furthermore, let the gradient of J' be denoted by Ayr = (AF, AG) as 
above. 

Lemma 2. For An ^ there exists a number ao > such that for all 
< a < ao, 

(33) J-«+"^-(U.J< J-'(U^J. 

Proof. By definition, Ayr is the steepest descent direction of c{\Jni) + 
J'^*(S'^U7r-), and it is nonzero, therefore for sufficiently small (3o, for all 
< /3 < /3o, 



(34) 
(35) 



c(U^J+7J"^(5.,+/3A.U^,J < 
<c(U.J+7J^^(^.,U.J = J'^'(U.J. 
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Let 

U := {U G M"x(^+i) I c(U) + 7-f"'(5.,+/3A.U) < J'^'(U) for all P < po] 

be the set of states for which vTj + /^qAtt is an improvement over vTj. 
This is an open set, because c(U) + 7 J'^'(5',ri+/3oA7rU) — J'^'(U) is a 
continuous function of U. Furthermore, Utt- G W, so it also contains a 
ball with some positive radius r centered on U,rr 

Note that "11,^^ = S-^XJ^,. and the operator S-^^+p/s^j, is continuous in 
/?, so there exists a /3i so that \\^-Ki — 5',ri+/3iA7rU7rJ| < r, which means 
that 5"^ +/3iA7rU^. G W, and in general, for all k there exists a so that 
5.^+AA>.. e Let 

(36) ao := min(/3o,/?i, • • • ,/?r-i)- 
Then for all < A; < T and < a < cto 

(37) c(^^^_,„^^U.J + 7^^^(5.Y+aA.U.J < ^'^'(S^^+.A.U.J 

by the definition of U. Adding these inequalities with weights 7^^ gives 

(38) c(U.J + 7c(5..+«A.U.J + . . . + 7'^-'c(5:-^„^,U.J + 

(39) Wr'{Sl^a^^\]^;) < J'^'(U.J. 

The LHS is exactly the cost of taking T steps using vTj + a;A7r, and then 
using TTj. Noting that this is the same as following tTj + oAtt all the 
way, so we get 

(40) J'^'+°^'^(U^J< J'^'(U^J, 

which was to be shown. □ 

Lemma 3. For Ayr ^ there exists a number ai > such that for all 
< a < «!, 

(41) J'^'+"^'^(U^,:+.A.) < J"'(U^J. 

Proof. The proof proceeds similarly as the proof of Lemma El Let 

(42) W := {U G I j^>+"A^(u) < J'''(U^J 

for all a < «o} 

This set is open as well because of the continuity of J, and by LemmaEl 
it contains U,ri, and thus contains also a ball centered around Utt-. This 
means that for sufficiently small /?o, ^Tr^+zjATrU^. G W for all < /3 < /3o. 
Setting ai := min(ao,/3o) proves the statement of the lemma. □ 

These lemmas enable us to prove convergence for the exact cost 
function case: 
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Theorem 4. If the learning rates ai are determined by Lemma\^ PI- 
RANHA converges to a policy if that is a (local) minimum of J'" (U -j^) . 

Proof. As Lemma El states, J'^(U^) is monotonously decreasing, thus 
it is necessarily convergent, which also means that the policy series 
TTi converges to some vr = limj^ooTTj. By taking the limit of VTj+i = 
vTi + ajAvr we get by applying lemma|Slto fc that aAfc = 0. This means 
that either Att = or a = 0. However, if Ayr 7^ 0, Lemmas |21 and 01 
guarantee the existence of a positive learning rate a, so the gradient 
necessarily vanishes. □ 

Now suppose that we do not compute the exact J' for computing the 
gradient, but use the approximation 



instead with some fixed lookahead parameter K. The following lemma 

is a rephrasing of lemma ^ and shows that we can get arbitrarily close 

to the minimum if K is sufficiently large and the norm of recurrent 

weights does not grow much beyond 1. 

To this end, let us first define the norm of a policy vr: recall that vr is 

the concatenation of a n x n and an n x (m+ 1) real- valued matrix. If we 

regard policies as n{n + m + 1) dimensional vectors, then we can define 

the scalar product (vri,7r2) of two policies as normal scalar products 

in Euclidean space, and the norm of policy vr as ||7r|| = (7r,7r)^/^. The 

norm of matrices and vectors is defined as usual: ||.||_ denotes the 

' II II 00 

maximum- norm. 

Theorem 5. Suppose that (a) For alii, \\Fi\\^ < 1/a/7' (^) -^^^ slope 
of the sigmoid function is at most 1, i.e. sup^(j'(z) < 1. Then for 
any cq > there exists a K so that if \4'J^ is used for calculating the 
gradient TTi, then limsupj^o^ ll^^ill — ^o- 

Proof. For the proof we will exploit the fact that gradient descent is 
convergent with approximating the direction of the steepest descent as 
long as actual direction of descent and the direction of steepest descent 
enclose an acute angle, i.e. as long as the scalar product is positive. 
In our case this means that we have to prove that at each iteration, 
(Avr, Avr) > 0. We begin by bounding || Avr — A7r||. 



T 



K 



(43) 




t=l k=l 
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Lemma Q states that he difference of the exact and approximate 
derivatives can be bounded by 



(44) 

(45) < 



dJ' 

(Fi, Gi) — {Fi, Gi] 



OFab 




< 



1 - 7I/2 ■ 

where Cq is a constant independent of and the same holds for the 
differences of derivatives with respect to elements of G. 

Therefore the norm of the gradient error Att — Att can be bounded 
from above as 

(46) IIAtt - AttII < ^n{n + m + l)GQ-i^''^ := Gi-f^^^. 

This can be made smaller than any fixed eo > if > ^ '"^ig'-^*"^ ^ • 
Applying this result to the scalar product yields 

(47) (Att, Att) = (Avr, Avr) + (Att, Att - Avr) 

(48) > II Att f - II Att II IIAtt - AttII 

(49) > ||Att||(||Att|| -eo), 

which is greater then zero if ||Att|| > eo, but this holds by the assump- 
tion of the theorem. □ 

Remark. Assumption (a) requires that the recurrent weights remain 
relatively small (the norm of F does not grow much beyond 1), thus 
avoiding chaotic behavior. Although this constraint cannot be verified 
a priori, it can be enforced by either choosing small 7 values or by 
applying weight regularization technique. 

6. Discussion 

6.1. A Hierarchy of Reinforcement Learners? In many reinforce- 
ment learning problems, the state space is prohibitively large or con- 
tinuous, so storing all the state values is infeasible. In most cases this 
problem is resolved by applying some kind of function approximation. 
It is well known that neural networks have excellent function approx- 
imation capabilities, so it is no wonder that they have been widely 
applied in various RL problems. These neural networks are all feed- 
forward ones, which is completely satisfactory for estimating the value 
function of Markovian problems, as there is no need to keep past mem- 
ories. 

However, the application of recurrent neural networks can also be 
justified, if instead of value function approximation they are used for 
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identifying abstract states of the problem: they are able to identify 
spatio-temporal regions of the state space, not only spatial ones. This 
means that they give a tool for handling non-Markovian, partially ob- 
servable problems. Furthermore, their prediction capability provides a 
model of the environment as well. 

As a consequence, RNNs with PIRANHA fit very naturally into a 
hierarchical RL scheme: The states of the lower level are the raw in- 
put (which is continuous and non-Markovian). The lower-level agent 
uses the reconstruction/prediction error as reward signal and is able to 
learn identifying some region of the input, providing high-level state de- 
scription. This state description can then be used by a traditional RL 
agent on the upper level. Working out the details of this hierarchical 
structure is the topic of ongoing research, see, e.g., (fSzita et al., 20021 
Lorincz et al. , 20031, ISzita et al ., 20"03,|Bakker and Schmidhuber, 2004a| 
Balc^e^nd Schmidhuber, 2004lo|) and references therein. 



6.2. Summary. In this article, a solution to the sequence replay prob- 
lem was proposed. The question was to represent time series data so 
that it can be reproduced sequentially from seeing only a small portion 
of it. We formalized the task as the minimization of the cost function 
of a recurrent neural network. An analogy to reinforcement learning 
enabled us to adapt policy iteration method to our problem, yielding a 
novel algorithm that we called PIRANHA. We showed that PIRANHA 
is convergent, and fits naturally into the RL framework, yielding a 
hierarchical architecture. 
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